Predicting the age of abalone by using only physical measurements

Group 005E06

Reid Wang, Ivan Yeh, Christy Lee, Paihao Zhang and Honghui Huang

Introduction to the Data

  • Sourced from the UCI ML Repo (Nash et al., 1995)
  • Consists of measurements taken from 4177 abalone
  • The continuous variables were multiplied by 200 at this step.
Field Name Data Type Data Format Length Description Example
Sex Character X 1 Sex of abalone M, F, I
Length Float N.NNN 4 Length in mm \(\div\) 200 0.455
Diameter Float N.NNN 4 Diameter in mm \(\div\) 200 0.365
Height Float N.NNN 4 Height in mm \(\div\) 200 0.095
Whole_weight Float N.NNNN 5 Whole weight in g \(\div\) 200 0.5140
Shucked_weight Float N.NNNN 5 Weight of meat in g \(\div\) 200 0.2245
Viscera_weight Float N.NNNN 5 Gut weight in g \(\div\) 200 0.1010
Shell_weight Float N.NNNN 4 Dry shell weight in g \(\div\) 200 0.150
Rings Integer NN 2 No. of rings, +1.5 for age 15

Introduction to the Problem

  • Goal is to predict the age of abalone using physical measurements
  • This speeds up the process of surveying abalone age
  • But a model that includes shucked weight, viscera weight and shell weight would still require killing the abalone

Model Selection: Context of Variables

  • So we decided to also make a “live” model, which uses variables that can be measured without killing the abalone
  • This will be compared with the full model later
Original Dataset Live Abalone Dataset
Sex Sex
Length Length
Diameter Diameter
Height Height
Whole_weight Whole_weight
Shucked_weight Rings
Viscera_weight
Shell_weight
Rings

Data Distribution

  • There were no missing values and the data is independent.
  • Rings is positively skewed

Improving Linearity

The scaling factor for each variable is as follows:

\(-(\frac{1}{Rings})^{\frac{1}{4}} = \beta_0 + Sex[M] + Sex [F] + Sex[I] + \beta_1\log_{10}(Length) + \beta_2\log_{10}(Diameter) + \beta_3Height^{\left(\frac{1}{3}\right)}\)

\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ + \beta_4\log_{10}(Whole\ Weight) + \beta_5\log_{10}(Shucked\ Weight)\)

\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ + \beta_6\log_{10}(Viscera\ Weight) + \beta_7\log_{10}(Shell\ Weight) + \varepsilon_i\)

Correlation heat map

All variables are highly correlated with all the coefficients of determination ≥ 0.6

Model Selection: AIC Minimisation Approach

  • Penalise unnecessary variables
  • Form the best combination of variables to fit the model
  • The lower the AIC the better

Backward Approach

  • Start from a full model
  • Remove the least formative predictor, variable by variable, to minimise AIC

Forward Approach

  • Start from a null model
  • Add the most formative predictor, variable by variable, to minimise AIC

Model Selection: AIC Minimisation Approach

The selection process involves forward searching and back searching based on the scaled live and full abalone datasets.

  Backward Live Abalone Model Forward Live Abalone Model Backward Original Model Forward Original Model
Predictors Estimates p Estimates p Estimates p Estimates p
(Intercept) -0.22 <0.001 -0.22 <0.001 -0.21 <0.001 -0.21 <0.001
Sex [F] 0.00 0.290 0.00 0.290 -0.00 0.704 -0.00 0.704
Sex [I] -0.00 <0.001 -0.00 <0.001 -0.00 <0.001 -0.00 <0.001
Length -0.04 <0.001 -0.04 <0.001 -0.02 0.002 -0.02 0.002
Diameter 0.06 <0.001 0.06 <0.001 0.02 0.004 0.02 0.004
Height 0.01 <0.001 0.01 <0.001 0.00 <0.001 0.00 <0.001
Whole weight 0.01 <0.001 0.01 <0.001 0.05 <0.001 0.05 <0.001
Shucked weight -0.05 <0.001 -0.05 <0.001
Viscera weight -0.01 <0.001 -0.01 <0.001
Shell weight 0.03 <0.001 0.03 <0.001
Observations 4177 4177 4177 4177
R2 / R2 adjusted 0.543 / 0.543 0.543 / 0.543 0.656 / 0.656 0.656 / 0.656
AIC -28130.663 -28130.663 -29313.170 -29313.170

Models Produced

Live Abalone Model

\(-(\frac{1}{Rings})^{\frac{1}{4}} = -0.2160858 + 0.0003333Sex[F] - 0.0029337Sex [I] -0.0437280\log_{10}(Length)\)

\(+ 0.0554263\log_{10}(Diameter)+ 0.0084491Height^{\left(\frac{1}{3}\right)} + 0.0104607\log_{10}(Whole\ Weight)\)


### Original Abalone Model

\(-(\frac{1}{Rings})^{\frac{1}{4}} = -0.2072139 - 0.0001040Sex[F] - 0.0018633Sex[I] -0.0222445\log_{10}(Length)\)

\(+ 0.0192558\log_{10}(Diameter) + 0.0031554Height^{\left(\frac{1}{3}\right)} + 0.0466486\log_{10}(Whole\ Weight)\)

\(-0.0484394\log_{10}(Shucked\ Weight)- 0.0058606\log_{10}(Viscera\ Weight) + 0.0309122\log_{10}(Shell\ Weight)\)

Explaining the Live Model

\(-(\frac{1}{Rings})^{\frac{1}{4}} = -0.2160858 + 0.0003333Sex[F] - 0.0029337Sex [I] -0.0437280\log_{10}(Length)\)

\(+ 0.0554263\log_{10}(Diameter)+ 0.0084491Height^{\left(\frac{1}{3}\right)} + 0.0104607\log_{10}(Whole\ Weight)\)

  • Male variable becomes the intercept
  • 1 for the corresponding sex, 0 for others
  • We expect, leaving all else constant:
    • For every 1 unit increase in log length, the quad root of inverse Rings increases by 0.0437280
    • For every 1 unit increase in cube root in height, the quad root of inverse Rings decreases by 0.0084491
    • All variables deduced in a similar manner

Explaining the Original Model

\(-(\frac{1}{Rings})^{\frac{1}{4}} = -0.2072139 - 0.0001040Sex[F] - 0.0018633Sex[I] -0.0222445\log_{10}(Length)\)

\(+ 0.0192558\log_{10}(Diameter) + 0.0031554Height^{\left(\frac{1}{3}\right)} + 0.0466486\log_{10}(Whole\ Weight)\)

\(-0.0484394\log_{10}(Shucked\ Weight)- 0.0058606\log_{10}(Viscera\ Weight) + 0.0309122\log_{10}(Shell\ Weight)\)

  • Male variable becomes the intercept
  • 1 for the corresponding sex, 0 for others
  • We expect, leaving all else constant:
    • For every 1 unit increase in log length, the quad root of inverse Rings increases by 0.0222445
    • For every 1 unit increase in cube root in height, the quad root of inverse Rings decreases by 0.0031554
    • All variables deduced in a similar manner

Model Assumption Checking

Live Abalone Model (Scaled)

  • Linearity: Residuals are approximately symmetrical in its distribution above and below zero.
  • Normality: Residuals are approximately normally distributed since most of the points align with the normal line.
  • Homoscedasticity: Residuals are scattered symmetrically around the 0 line with fairly even variance and linearity.

Model Assumption Checking

Original Abalone Model (Scaled)

  • Linearity: Residuals are approximately symmetrical in its distribution above and below zero.
  • Normality: Residuals are approximately normally distributed since most of the points align with the normal line.
  • Homoscedasticity: Residuals are scattered symmetrically around the 0 line with fairly even variance and linearity.

Performance Assessment - Cross Validation

Live abalone data (scaled)

Rsquared RMSE MAE
0.541 0.008 0.007

Raw live abalone data (no scaling)

Rsquared RMSE MAE
0.364 520.228 365.98

Original data (scaled)

Rsquared RMSE MAE
0.654 0.007 0.006

Raw Original data (no scaling)

Rsquared RMSE MAE
0.53 442.57 316.687

Implications of the Original Model

  • Performs slightly better, but is environmentally damaging
  • More suitable if abalone are already intended for harvest, but can’t be used for some research applications (e.g. endangered species)

Image Source

Implications of the Live Model

  • Non-invasive way of estimating abalone age, with only small penalty in accuracy
  • More socially acceptable and is more versatile for monitoring abalone populations
  • However, additional effort is required to properly return the abalone to their habitats

Image Source

Comparison of Models

  • Since the live model offers similar accuracy while preserving the abalone, it is our preferred model
  • However, care should be taken when collecting data for this model to ensure that the abalone aren’t harmed
    • Abalone are haemophiliacs so they can’t clot blood

Live Model (scaled)

Rsquared RMSE MAE
0.54 0.008 0.007

Original Model (scaled)

Rsquared RMSE MAE
0.652 0.007 0.006

Limitations

  • Interpretability of the model if relatively low due to scaling
  • Model is sensitive to rounding error due to low coefficients
  • May not apply to some species of abalone
  • AIC minimisation also has some limitations (Hurvich & Tsai, 1989)

Image Source

Hurvich, C. M., & Tsai, C. (1989). Regression and time series model selection in small samples. Biometrika, 76(2), 297–307. https://doi.org/10.1093/biomet/76.2.297

Future Directions and Conclusion

  • Overall, we found that the live, non-invasive model was preferred
  • In the future, we could the model on other species of abalone or abalone from other places in the world
  • We will also source stakeholder investment to help improve our model

Image Source